AITopics | evaluation method

Collaborating Authors

evaluation method

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Robustness of Refugee-Matching Gains to Off-Policy Evaluation Choices

Bansak, Kirk, Paulson, Elisabeth, Rothenhäusler, Dominik, Ferwerda, Jeremy, Hainmueller, Jens, Hotard, Michael

arXiv.org Machine LearningMay-11-2026

Previous research has investigated the potential of refugee matching for boosting refugee outcomes, first considered by Bansak et al. (2018). This paper demonstrates the stability of counterfactual impact evaluation results in the context of refugee matching in the United States using a range of off-policy evaluation methods. In order to estimate counterfactual impact and test the robustness of our results, we employ several evaluation methods, including inverse probability weighting (IPW) and multiple variants of augmented inverse probability weighting (AIPW). We also consider various modifications, including alternative modeling architectures and different assignment procedures. The impact estimates remain consistent in magnitude in all scenarios as well as statistically significant in most cases. Furthermore, the estimates are also consistent with the results originally presented in Bansak et al. (2018).

artificial intelligence, assignment, machine learning, (14 more...)

arXiv.org Machine Learning

2605.06686

Country: North America > United States (0.49)

Genre: Research Report > New Finding (0.66)

Industry:

Government > Immigration & Customs (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.90)
Government > Regional Government (0.90)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Evaluating Post-hoc Explanations for Graph Neural Networks via Robustness Analysis

Neural Information Processing SystemsApr-30-2026, 02:54:31 GMT

This work studies the evaluation of explaining graph neural networks (GNNs), which is crucial to the credibility of post-hoc explainability in practical usage. Conventional evaluation metrics, and even explanation methods -- which mainly follow the paradigm of feeding the explanatory subgraph to the model and measuring output difference -- mostly suffer from the notorious out-of-distribution (OOD) issue. Hence, in this work, we endeavor to confront this issue by introducing a novel evaluation metric, termed OOD-resistant Adversarial Robustness (OAR). Specifically, we draw inspiration from adversarial robustness and evaluate post-hoc explanation subgraphs by calculating their robustness under attack. On top of that, an elaborate OOD reweighting block is inserted into the pipeline to confine the evaluation process to the original data distribution. For applications involving large datasets, we further devise a Simplified version of OAR (SimOAR), which achieves a significant improvement in computational efficiency at the cost of a small amount of performance.

artificial intelligence, explanation, machine learning, (16 more...)

Neural Information Processing Systems

Country:

Europe (0.93)
North America > United States (0.46)
North America > Canada (0.28)
Asia > China (0.28)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Elo Uncovered: Robustness and Best Practices in Language Model Evaluation

Neural Information Processing SystemsMar-22-2026, 08:24:59 GMT

In Natural Language Processing (NLP), the Elo rating system, originally designed for ranking players in dynamic games such as chess, is increasingly being used to evaluate Large Language Models (LLMs) through A vs B paired comparisons.However, while popular, the system's suitability for assessing entities with constant skill levels, such as LLMs, remains relatively unexplored. We study two fundamental axioms that evaluation methods should adhere to: reliability and transitivity. We conduct an extensive evaluation of Elo behavior across simulated and real-world scenarios, demonstrating that individual Elo computations can exhibit significant volatility.We show that both axioms are not always satisfied, raising questions about the reliability of current comparative evaluations of LLMs.If the current use of Elo scores is intended to substitute the costly head-to-head comparison of LLMs, it is crucial to ensure the ranking is as robust as possible.Guided by the axioms, our findings offer concrete guidelines for enhancing the reliability of LLM evaluation methods, suggesting a need for reassessment of existing comparative approaches.

artificial intelligence, large language model, natural language, (7 more...)

Neural Information Processing Systems

Industry: Leisure & Entertainment > Games > Chess (0.60)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

e2e06adf560b0706d3b1ddfca9f29756-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsFeb-18-2026, 11:05:54 GMT

large language model, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (1.00)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Information Technology > Security & Privacy (1.00)
Government (1.00)
(2 more...)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

e15c4afff22f12c4986c1fcb4e941e03-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsFeb-18-2026, 10:42:43 GMT

bioinformatics, large language model, machine learning, (20 more...)

Neural Information Processing Systems

Country:

Asia > Middle East > Israel (0.04)
Asia > South Korea > Seoul > Seoul (0.04)

Genre: Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Health Care Technology > Medical Record (1.00)
(5 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Biomedical Informatics (0.93)

Add feedback

e55c2f3fdde519014c879aa3554414c0-Paper-Conference.pdf

Neural Information Processing SystemsFeb-17-2026, 16:09:43 GMT

data mining, explanation, machine learning, (17 more...)

Neural Information Processing Systems

Country:

North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
Europe > Austria (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
(7 more...)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Data Science > Data Mining (0.94)

Add feedback

ZSC-Eval: An Evaluation Toolkit and Benchmark for Multi-agent Zero-shot Coordination Xihuai Wang

Neural Information Processing SystemsFeb-13-2026, 12:55:22 GMT

The significant difference between the deployment-time partners' distribution and the training partners' distribution determined by the

artificial intelligence, machine learning, reinforcement learning, (17 more...)

Neural Information Processing Systems

Country:

Asia > China > Shanghai > Shanghai (0.04)
Oceania > New Zealand > North Island > Auckland Region > Auckland (0.04)
Europe > Portugal (0.04)

Genre:

Research Report > New Finding (0.68)
Research Report > Experimental Study (0.46)

Industry: Leisure & Entertainment > Games (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.68)

Add feedback

On GANs and GMMs

Eitan Richardson, Yair Weiss

Neural Information Processing SystemsFeb-12-2026, 02:41:53 GMT

Neural Information Processing Systems http://nips.cc/

dataset, gan, international conference, (11 more...)

Neural Information Processing Systems

Country:

North America > Canada > Ontario > Toronto (0.14)
Asia > Middle East > Israel > Jerusalem District > Jerusalem (0.05)
North America > Canada > Quebec > Montreal (0.04)

Genre: Research Report (0.46)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

RobustDeepReinforcementLearning throughAdversarialLoss

Neural Information Processing SystemsFeb-11-2026, 11:35:22 GMT

Our RADIAL-RL agents consistently outperform prior methods when tested against attacks of varying strength and are more computationally efficient to train. In addition, we propose a new evaluation method calledGreedyWorst-Case Reward(GWC) tomeasure attack agnostic robustness of deep RL agents. We show that GWC can be evaluated efficiently and is a good estimate of the reward under the worst possible sequence of adversarial attacks.

artificial intelligence, machine learning, reinforcement learning, (19 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

An Information-Theoretic Evaluation of Generative Models in Learning Multi-modal Distributions

Neural Information Processing SystemsDec-24-2025, 04:17:40 GMT

The evaluation of generative models has received significant attention in the machine learning community. When applied to a multi-modal distribution which is common among image datasets, an intuitive evaluation criterion is the number of modes captured by the generative model. While several scores have been proposed to evaluate the quality and diversity of a model's generated data, the correspondence between existing scores and the number of modes in the distribution is unclear. In this work, we propose an information-theoretic diversity evaluation method for multi-modal underlying distributions. We utilize the R\'enyi Kernel Entropy (RKE) as an evaluation score based on quantum information theory to measure the number of modes in generated samples.

generative model, information-theoretic evaluation, rke score, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.76)

Add feedback